Expanding textual entailment corpora fromWikipedia using co-training
نویسندگان
چکیده
In this paper we propose a novel method to automatically extract large textual entailment datasets homogeneous to existing ones. The key idea is the combination of two intuitions: (1) the use of Wikipedia to extract a large set of textual entailment pairs; (2) the application of semisupervised machine learning methods to make the extracted dataset homogeneous to the existing ones. We report empirical evidence that our method successfully expands existing textual entailment corpora.
منابع مشابه
Detecting Cross-Lingual Semantic Divergence for Neural Machine Translation
Parallel corpora are often not as parallel as one might assume: non-literal translations and noisy translations abound, even in curated corpora routinely used for training and evaluation. We use a cross-lingual textual entailment system to distinguish sentence pairs that are parallel in meaning from those that are not, and show that filtering out divergent examples from training improves transl...
متن کاملThe Description of the NTOU RITE System in NTCIR-9
The textual entailment system determines whether one sentence can entail another in a common sense. We proposed several approaches to train textual entailment classifiers, including setting ancestor distance threshold, expanding training corpus, using different sets of features, and tuning classifier settings. The results show that a MC classifier trained by using an expanded training corpus an...
متن کاملAutomatic Building and Using Parallel Resources for SMT from Comparable Corpora
Building parallel resources for corpus based machine translation, especially Statistical Machine Translation (SMT), from comparable corpora has recently received wide attention in the field Machine Translation research. In this paper, we propose an automatic approach for extraction of parallel fragments from comparable corpora. The comparable corpora are collected from Wikipedia documents and t...
متن کاملA Preliminary Study of Finding Entailing Texts in a Domain-specific Monolingual Parallel Corpora
This paper introduces the possible usages, benefits, and challenges involved in the use of domain-specific monolingual parallel corpora in determining textual entailment (TE). A system that finds entailing text for a given statement is to be developed using monolingual parallel translations of the Bible as corpus as this is one of the most accessible monolingual parallel corpora. Different exis...
متن کاملRecognizing Paraphrases And Textual Entailment Using Inversion Transduction Grammars
We present first results using paraphrase as well as textual entailment data to test the language universal constraint posited by Wu’s (1995, 1997) Inversion Transduction Grammar (ITG) hypothesis. In machine translation and alignment, the ITG Hypothesis provides a strong inductive bias, and has been shown empirically across numerous language pairs and corpora to yield both efficiency and accura...
متن کامل